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ABSTRACT 

We apply machine learning in the form of a nearest neighbor instance-based algorithm (NN) to 
generate full photometric redshift probability density functions (PDFs) for objects in the Fifth Data 
Release of the Sloan Digital Sky Survey (SDSS DR5). We use a conceptually simple but novel 
application of NN to generate the PDFs — perturbing the object colors by their measurement error — 
and using the resulting instances of nearest neighbor distributions to generate numerous individual 
redshifts. When the redshifts are compared to existing SDSS spectroscopic data, we find that the mean 
value of each PDF has a dispersion between the photometric and spectroscopic redshift consistent with 
other machine learning techniques, being a = 0.0207 ± 0.0001 for main sample galaxies to r < 17.77 
mag, a — 0.0243 ± 0.0002 for luminous red galaxies to r < 19.2 mag, and a = 0.343 ± 0.005 for 
quasars to i < 20.3 mag. The PDFs allow the selection of subsets with improved statistics. For 
quasars, the improvement is dramatic: for those with a single peak in their probability distribution, 
the dispersion is reduced from 0.343 to ti = 0.117±0.010, and the photometric redshift is within 0.3 of 
the spectroscopic redshift for 99.3 ± 0.1% of the objects. Thus, for this optical quasar sample, we can 
virtually eliminate 'catastrophic' photometric redshift estimates. In addition to the SDSS sample, we 
incorporate ultraviolet photometry from the Third Data Release of the Galaxy Evolution Explorer All- 
Sky Imaging Survey (GALEX AIS GR3) to create PDFs for objects seen in both surveys. For quasars, 
the increased coverage of the observed frame UV of the SED results in significant improvement over 
the full SDSS sample, with a = 0.234 ±0.010. We demonstrate that this improvement is genuine and 
not an artifact of the SDSS-GALEX matching process. 

Subject headings: methods: data analysis — catalogs — quasars: general — cosmology: miscellaneous 



1. INTRODUCTION 

Advances in CCD and other technologies are en- 
abling modern wide-field surveys to provide high qual- 
ity photometry f or ever- increasing numbers of astronom- 
ical objects (e.g.. iKronll 19951 : lReshetnikovl2005l : [Lawrenc j 
[2007) . Comparable advances in multifiber spectrographs 
are enabling s imilarly increasing numbers of spectra to 
be taken (e.g.. iLahav fc Sutal[200HYigi2007| ). However, 
due to the increased integration time required to obtain 
a meaningful spectrum to a given depth compared to an 
image, the number of spectra available typically lags the 
number of images by more than an order of magnitude. 
Given the importance of the physical information con- 
tained within a spectrum compared to that within an 
image, for example, much more accurate diagnostics of 
an object's type and its redshift, any comparable infor- 
mation that can be obtained from the image is of great 
importance. With the much larger numbers of objects 
for which photometry is available, for applications that 
do not require high resolution spectra this information 
can even surp ass the spectra in ter ms of statistical sig- 
nificance (e.g.,|Blake fc Bridle 2005). 

However, for many applications, it is vitally important 
to know not only the photometric information, but also 
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its relative accuracy within the dataset for each object. 
Typically, this might be achieved by, for example, provid- 
ing an error on a measured magnitude, or an estimated 
Gaussian dispersion on a photometric redshift. In gen- 
eral, however, one would like to utilize the full probability 
density function (PDF) within analyses, so that one can 
exclude objects which do not meet specific criteria, or 
fold the information into the analysis. 

An area in which PDFs are of particular utility is 
photometric redshifts. For many purposes, provided 
that they are reasonably accurate, the final raw accu- 
racy of a redshift estimate is not vitally important, pro- 
vided that the error distribution is well known. Pho- 
tometric redshifts, particularly those of quasars, are 
known to suffer from a percenta g e of 'catastrophic' fai l- 
ures, e.g., iBudayari et all (|2001|): fllichards et all (|2001h : 
IWeinstein et al.l (|2004f ): IWu et al .1 (|2004n . in which the 
derived value is very different from the true value, e.g., 
z 0.7 instead of z ~ 2.2. PDFs can help to minimize 
these because in many cases the PDF for such an object 
will contain two or more peaks in the redshift probability 
function. 

Previous work on PDFs for photometric redshifts 
has concentrated on their derivation using a color- 
redshift relation, derived either empirically or from 
spectroscopic templates. Examples in which PDFs 
or distribu t ions for objects are shown include 
i Lanzetta et alJ (I1996D: iFernandez-Soto et al.l (I1999D: 
'Kodama et al.' (1999); 'Bem'tezf (200(f); 'Bolzonella et aU 
(2000); Firth et__ aL (,200_3,), and Brodwin ct al. (20,QJ) 
for galaxies, and iBudavari et al.l (|200lf ): iRichards et al.l 
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((200ll) : lWeinstein et all (pOOl . and lWu et all (|200l for 
quasars. 

In this paper, we utilize objects with spectra from 
the F ifth Data Release (DR5; lAdelman-McCarthv et al 



^Of) of the Sloan Digital Sky Survey (SDSS-lYork^Tal 



2000. ') to train a nearest-neighbor instance-based machine 
learning algorithm, and perform blind tests to assess the 
utility of the method in assigning PDFs. We present 



resul ts for main sample galaxies (MS Gs: iStrauss et al 



2002), luminous re d galaxies (LRGs: [ Eisenstein et al . 
2001), and quasars (|Richards et al.ll200a r Each of these 
samples have successively lower sample densities, but 
probe larger cosmic volumes. With our approach, it 
is also possible to generate PDFs for the entire SDSS 
photometric database. Simila r work was carried out for 
classification probabilities by iBall et all ()2006D for the 
quantity P(star,galaxy,neither-star-nor-galaxy). How- 
ever, such an effort is beyond the scope of the current 
paper, as we work solely with objects for which spectra 
are available. 

In addition to the SDSS, we cross- match the SDSS 
data to the Third Data Release (GR3; iMorrissev et al.l 
|2007| ) of the Galaxy E volution Explorer A ll-Sky Imaging 
Survey (GALEX AIS: lMartin"eral|[2005[) . This provides 
an additional two bands in the near- and far-UV, giving 
useful information by extending the SED coverage, e.g., 
so that at z < 0.3 we can potentially sample information 
for quasars from both the Mg II line in the UV and Ha 
in the optical. 

2. DATA 

We utilize data from the SDSS DR5 and the GALEX 
AIS GR3. In the SDSS, we select primary non-repeat 
observations of objects with spectra classified as galax- 
ies and quasars (specClass = galaxy, qso or hiz_qso) 
in the spec Ob j view of the Catalog Archive Server 
(CAS). In GALEX we select photometric objects with 
primaryJlag = 1 from its similar CAS interface. All 
object attributes and errors used are from these sources. 
The data are retrieved via SQL queries. 

In the SDSS, the object attributes retrieved are the 
magnitudes, uqri z, the associated erro rs derived from 
photon statistics (jStoughton et al.ll2002[ ). and the spec- 
tral type. The SDSS imaging covers the wavelength 
range 30G0A- 10, 500A, and the spectra 3800A-9200A. 
Each magnitude is measured in four different apertures: 
PSF, fiber, Petrosian, and model; and we require all mag- 
nitudes to be within the range < mag < 40, and mag- 
nitude errors to be within < magErr < 10. Much 
tighter cuts could reasonably be applied; but we sim- 
ply wish to eliminate extreme outlying values that are 
entirely unphysical (e.g., -9999) as they can cause in- 
stability in the learning algorithm. We also note that 
less outlying values should be easily accounted for by 
the learning process. Throughout this work, the SDSS 
magnitudes are corrected for Galacti c extinction using 
the dust maps of iSchleeel etHI (fl99l . and the GALEX 
magnitudes are corrected using the B — V (e_bv) term 
inferred from t hese m aps using the standard formula of 
iCardelh et al] (fl98l . 

We subdivide the objects into three samples: main 
sample galaxies, luminous red galaxies, and quasars. 
Each sample is subject to additional cuts as appropri- 
ate. For MSGs (specClass = 2), we require, following 



IStrauss et al.l (|2002f ). Petrosian magnitude r < 17.77, 
zWarning = 0, zStatus > 2, and zCo nf > 0.85. For 
LRGs, we apply the selection criteria of Eisenstein et al.l 
(|2001f ). resulting in specClass = 2, primTarget = 
TARGET_GALAXY_RED, z > 0.2, zWarning 0, zStatus > 
2, and zConf > 0.85. For quasars, we require 
specClass = 3 or 4, zWarning = 0, and zStatus > 2. 
We remind the reader that extinction-corrected magni- 
tudes are used throughout. The resulting numbers of 
objects are 413,361 MSGs, 66,268 LRGs, and 55,743 
quasars. The q uasar sample is the same as that of 
IBall et all (|2007L hereafter B07), with the loss of three 
objects due to our additional restriction on the magni- 
tude error. 

The SDSS samples are cross-matched to the primary 
photometric objects in the photoObjAll view of the 
GALEX AIS GR3 database, using an RA+decl. toler- 
ance (a distance on the sky) of 4 arcsec. This adds the 
near-UV (1750A - 2750A) and far-UV (1350A - 1750A) 
bands. We require the match to be unambiguous, in the 
sense that no SDSS object is within 4 arcsec of more 
than one GALEX object. Figure [1] shows histograms of 
the object separations. The majority of these are much 
smaller than 4 arcsec, indicating that our tolerance is 
reasonable. For the GALEX photometry, we construct 
samples requiring a detection in both the near and far- 
UV bands (band ~ 3), and a second set of samples just 
requiring detection in the near-UV band (band — 1). 
The latter are constructed because the samples are con- 
siderably larger, but still incorporate some UV informa- 
tion. We again require magnitudes in the range 0-40, 
and the flags fuv_artifact and nuv_artifact are re- 
quired to be 0. The resulting matches consist of 59,845, 
256, and 10,328 objects for MSGs, LRGs, and quasars 
respectively in near- and far-UV; and 100,826, 2316, and 
17,110 objects for near-UV only. Several qualitative fea- 
tures of these matches are as expected: (1) the near- 
UV-only samples are larger because they only require 
one detection not two; (2) there are very few matches 
to LRGs in GALEX, because LRGs are dominated by 
massive red early-type galaxies with little ongoing star 
formation; (3) the size of the quasar matched far- and 
near-UV sample increases in the same proportion to the 
size of the GALEX dataset as a whole compared to that 
obtained for GALEX GR2 by B07; and (4) there are few 
quasar matches beyond z ^ 2, because the Lyman limit 
at 912A has redshifted out of the GALEX bands. 

Following B07, in addition to the SDSS and 
SDSS-I-GALEX datasets, we also analyze the 
SDSS-I-GALEX sample of objects, but using only 
SDSS features. As in B07, these datasets are referred to 
as GALEX-SDSS-only, and they enable us to quantify 
the level of improvement in SDSS-I-GALEX seen from 
the addition of the GALEX UV features, as opposed to 
possible improvement due to the sample only contain- 
ing luminous quasars that appear in both SDSS and 
GALEX. 

3. ALGORITHMS 

We apply the NN instance-based learning to each of 
the datasets. Full details of the NN method and its ex- 
tension to fc-nearest neighb o rs (fcNN) are giv e n in B07, 
or in. e.g ■ lAha et all (|199lD : IWitten fc FrankI (pOOQl . or 
iHastie et al.l ()200lD . Briefly, the method requires a set 
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Fig. 1. — Histograms of separations in SDSS-GALEX cross-matches. The rows show main sample galaxies, luminous red galaxies, and 
quasars respectively. The left-hand column shows FUV-I-NUV, the right-hand column shows NUV-only. 



of training features for each object and a target prop- 
erty. The algorithm then compares the position in fea- 
ture space of each new object in the testing set to the 
training set, and assigns the target property of the near- 
est training set object. This may be generalized to a 
weighted sum of nearest neighbors, i.e., fc > 1. The 
method is computationally intensive'*, but we are able 
to exploit its full power by utilizing nationally allocated, 
peer-reviewed time on the Xeon Linux cluster Tungsten 
at the National Center for Supercomputing Applications 

NCSA), and the Java environment Data-to-Knowledge 

Weke et al.l[l99l . 

Throughout this paper, for the SDSS data the train- 
ing features are the 4 colors u — g, g — r, r — i, and 
i — z in the four magnitude types PSF^ fiber, Petrosian, 
and model. This results in 16 training features. In B07, 
genetic algorithms were used to investigate subsets of 
these parameters in a systematic way; however, no sub- 
set was found that resulted in significant improvement, 
and indeed many subsets were worse. Preprocessing the 
training features with principal component analysis may 
remove some redundancy and save computational time, 
but given the aforementioned B07 result, we elected to 
simply use the full 16 colors throughout, in the spirit of 
using the full information available. 

The target property is the spectroscopic redshift, which 
we regard as being correct as any error on this value is 
expected to be small compared to the photometric red- 
shift error. When cross-matched to the GALEX data, 
the addition of the far-UV (FUV) and near-UV (NUV) 
bands gives the additional colors FUV-NUV, and four in- 
stances of NUV-w, resulting in 21 training features. The 
GALEX-SDSS-only sets contain the same objects as the 
SDSS+GALEX, but with just the assigned 16 SDSS fea- 
tures used. 

As in B07, we standardize the training features such 

For example, to generate the MSG PDFs requires 172048 s on 
100 nodes for npDp rigaiaxy "validation = too X 82672 X 10 galaxies, 
giving 4.8 galaxies s^^. 



that each has a mean of and a variance of 1. We test the 
performance of the algorithm by splitting each dataset 
into a training set, consisting of an 80% random sub- 
sample of the data, and a blind test set, consisting of 
the remaining 20%. The blind test set does not over- 
lap with the training set, and this represents a realistic 
measure of the performance of the algorithm on unseen 
data within the same color regime. We perform 10-fold 
repeated holdout validation on each dataset, i.e., 10 dif- 
ferent training: blind test splits, and quote the mean and 
standard deviation of the results for each. 

The NN method as described, produces a single scalar- 
valued photometric redshift for each object (for each 
training:blind split). Therefore, to generate a probabil- 
ity density function in redshift for each object, we perturb 
the values of the training and blind features according to 
the given error on each feature. We assume the errors are 
Gaussian, and the perturbation is applied to the mag- 
nitude before the color is derived. This assumes that 
there is negligible co variance between the magnitudes. 
iScranton et al.l (|2005f ) show that in fact this is not neces- 
sarily the case with the available SDSS magnitude errors. 
However, these errors are underestimated, which cancels 
the covariance at the 10-20% level. We therefore use 
the values supplied in the SDSS DR5 for the work pre- 
sented here. For each MSG we apply 10 perturbations to 
the features in the training set and 10 to the blind test 
set, giving 100 photometric redshift values per galaxy. 
For LRGs and quasars, we similarly apply 32 perturba- 
tions, giving 1024 values per object. The same numbers 
of perturbations are used for SDSS, SDSS+GALEX, and 
GALEX-SDSS-only. 

We analyze the PDFs in terms of peaks in the prob- 
ability. We establish these peaks by fitting a piecewise 
polynomial spline to the binned redshift counts. Peak 
redshifts are defined as the value at which half the inte- 
grated area lies under a peak. We set a threshold under 
which the area does not count, so that very low peaks 
are not identified. The threshold level is that which the 
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Fig. 2. — Example PDF for an SDSS DR5 main sample galaxy. 
The black line is the PDF, a spline fit to the binned redshifts for 
each object (100 for main sample galaxies, 1024 otherwise); the red 
cross is the peak redshift, corresponding to half of the peak area; 
the blue dots are the binned individual redshifts; the horizontal 
dashed line is the peak threshold, and the shaded areas are the 
PDF peaks. 



Fig. 4. — As Figure |2] but for a quasar with one peak redshift. 
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Fig. 3. — As Figure [2] but for a luminous red galaxy. 

PDF would be at if it were completely flat. Thus peaks 
above this represent excess probability, and an object is 
essentially guaranteed to have at least one redshift peak. 
The area under a peak but also under the threshold is 
not included in the integral, so that the redshift values 
of the peaks are not pulled towards the peak centers by 
probability that would not otherwise count. Figures [JUS] 
show typical examples of object PDFs. 

4. RESULTS 

In general, we find the expected result that, for the 
SDSS, main galaxies and luminous red galaxies show 
mainly Gaussian-like PDFs, and quasars give a higher in- 
cidence of catastrophic failures. The addition of GALEX 
data does not greatly affect the galaxies, but substan- 
tially improves the results for quasars. We therefore 
present our galaxy results first, and then concentrate on 
the quasar results, for which we perform additional tests. 

The 2phot versus Zspoc dispersions and percentages of 
objects within Az < 0.1, 0.2, and 0.3 are tabulated for 
the whole blind test samples in Table [1] and for objects 
with a single PDF peak in Table [1] Note that the dis- 
persions are given as a throughout and not cr/(l -I- z). 

4.1. Galaxies 




Fig. 5. — As Figure|2]but for a quasar with multiple peak red- 
shifts. 

Figure [6] shows the photometric redshift, 2:phot, versus 
spectroscopic redshift, Zgpec, for the 82,672 SDSS DR5 
MSGs in the blind testing set. For each object shown, the 
value of Zphot is the mean of the set of individual redshift 
values that make up its PDF. The plot shows only one in- 
stance of the training:blind split from the ten used in the 
holdout validation, as each split divides the sample into 
different subsets. Also, uniquely among the datasets, the 
SDSS MSGs produced a small fraction (0.2%) of objects 
that could not be fit by the spline. These are excluded 
from the plot. We do not expect that the missing ob- 
jects would have any significant impact on the results 
quoted here because (1) the raw PDFs were examined 
visually for these objects and did not appear unusual; 
and (2) earlier results based on direct examination of the 
PDF histogram included these missing objects and did 
not yield a significantly different value of a. 

For DR5 MSGs, we find that the overall RMS dis- 
persion between Zphot and Zgpcc, taking into account 
the holdout validation, is cr = 0.0207 ± 0.0001. This 
is very similar to numerous previous pu blished results, 
who a ll obtain 0.02 ^ c < 0.025, e.g., iBrunner et al.l 
(|1997t l from Galactic fie ld s at h ig h latitude, an d in th e 
SDSS. iTagliaferri et all (I200I): iGsabai et al.l (l200a. 
iFirth et all ('2003'): 'BaU et al.' ('2004'): 'CoUister fc Lahavl 
(120041): IVa nzcUa et al. (2004); Wadadekar (2005); 
Wav fc Srivastaval (1200611: iD'Abrusco et all (l2007l): 



Kurtz et al l (l2007l) : n'Li et al.1 (l2007l): lOvaizu et al 



(|2008D : iWang et all ' (|2007L 120081) . and' iWrav fc Gunn 



Probabilistic Photometric Redshifts 



5 




I 

■ -i: 

■ -21 

- - 1; 
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Fig. 6. — Photometric versus spectroscopic rcdshift for the 82,672 
SDSS DR5 main sample galaxies of the blind testing set (20% of the 
sample). 2phot is the mean photoz from the PDF for each object. 
The result from a single split (of the ten used for validation) of the 
data into training and blind testing data is shown, a is the RMS 
dispersion between 2phot E^id Zapec- 



(|2007f) . Other methods for selecting a redshift from 
the PDFs, for example, the mode, median, and the 
same values using the binned data rather than the raw 
redshifts, give very similar results. 

The addition of GALEX data reduces the blind test- 
ing sample size to 11,969 for FUV-hNUV and 20,165 for 
FUV-only. For NUV-only, the dispersion is the same as 
the optical {a = 0.0209 ±0.0002), and for FUV+NUV it 
is slightly worse, at cr = 0.0231 ± 0.0004. The GALEX- 
SDSS-only values also show very similar dispersion, in- 
dicating that GALEX photometry is making little dif- 
ference to the spread. In fact, it is not surprising that 
the addition of the GALEX bands does little to help, 
for both MSG and LRG, because the dominant source of 
redshift information in color space, the 4000A break, IS 
always redwards of the GALEX bands. 

If one analyzes the ^phot values from a single run of the 
NN algorithm, without perturbing the input features and 
taking the mean, then the dispersion is higher than that 
derived from the mean of the PDF, typically a ^ 0.03. 
However, the galaxies become much more symmetrically 
distributed about the Zphot — ^^spoc locus. We discuss 
possible reasons for this in §5.11 

In the full results, if one selects galaxies with a sin- 
gle PDF peak, the a value improves slightly to to cr = 
0.0198 ±0.0001, but the values for multiple peaks are sig- 
nificantly higher. For galaxies with multiple peaks, if one 
artificially selects the peak nearest to 2;spec, regardless of 
its relative height compared to the other peaks (i.e., the 
'best peak', an estimate of the best possible photoz pre- 
diction), the resulting dispersions remain similar to the 
single-peak value for both SDSS and SDSS+GALEX. 

Given that the dispersion for the whole sample is 
a ^ 0.02, consistent with numerous previous results in 
the literature, and the generally Gaussian nature of the 
PDFs, and the lack of improvement from cither single 
peaks or the artificial best peaks, we conclude that the 
PDFs for MSGs are approximately optimal, given the 
data. 

Figure [7] shows the analogous results to Figure [6] for 
LR Gs. Again the a value is si r nilar to previou s work , 
e.g.. iPadmanabhan et all (l2005l); ICollister et all (|2007D : 
iD'Abrusco et al.l (|2007D rand lLopesl (|2007D : and the single 
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Fig. 7.— As Figure E] but for 13,254 SDSS DR5 luminous red 
galaxies. 
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Fig. 8.— As Figure^ but for 11,149 SDSS DR5 quasars. 

and 'best' peaks again make little difference. There is a 
kink in the 2:phot versus Zgpec plot at z ~ 0.35, likely due 
to the movement of the 4000A break between filters (see 

4.2. Quasars 

4.2.L SDSS DR5 

Figure [5] shows the mean PDF Zphot versus 2;spcc for 
11,149 blind test quasars, in a similar manner to Fig- 
ures [5] and [71 The result is similar to that obtained 
by B07 using our multiple-nearest neighbor approach. 
We obtained a value of ct = 0.35 (there quoted as 
cr2 = 0.123 ± 0.002), compared to ct = 0.343 ± 0.005 
here. 

Here, we are able to improve on this using the infor- 
mation available in the PDFs. Figure [9] shows the same 
data as Figure O but for those objects which have a sin- 
gle PDF peak. Although, averaged over the ten holdout 
validation runs, this reduces the sample size from 11,149 
to 4339 ± 24, and alters the selection function, the im- 
provement is dramatic. The dispersion is improved from 
CT = 0.343 ± 0.005 to CT = 0.117 ± 0.001. The percent- 
age of quasars within Az < 0.1, 0.2, and 0.3 is increased 
from 53.8 ± 0.4%, 72.4 ± 0.3%, and 79.8 ± 0.3% ^ to 
73.6 ± 0.6%, 96.3 ± 0.1%, and 99.3 ± 0.1%. Just 33 ± 4 
objects from 43 3 9 ± 2 4 (0.7%) remain as catastrophics. 
IWeinstein eFall (|2004l ) obtained 83% of quasars within 



5 IBall et al.l ||200W ) found 54.9 ± 0.7%, 73.3 ± 0.6%, and 80.7 ± 
0.3%. This is consistent within the errors. 
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Fig. 9. — As Figure [6] but for SDSS DR5 quasars with single 
PDF peaks. Over the ten validation runs, the number of objects 
with one peak from the blind test sample of 11,149 is 4339 it 24. 
The alteration of the selection function, n{z), is clear, but so is the 
dramatic improvement in the dispersion of the remaining objects. 
99.3% are within Az = 0.3. 




Fig. 10. — Alteration in the selection function for the subsample 
of SDSS DR5 quasars with one peak compared to the full sample. 
The horizontal dashed line shows the overall fraction of quasars 
with single-peaked PDFs, to which the red line would correspond 
if there were no alteration. 

I^spoc — z^photl < 0.3 for their whole sample, but then- 
dispersion is much higher (cf. their figure 4 and Figure 
[51). 

Figure [TOl shows the alteration in the selection function 
if we insist on only one peak in the PDF. The fraction 
of objects with one peak is either significantly decreased 
or increased from the average. Significant deficits occur 
^-f ^ -2 ^ 1 ^^^d 1-9 < z < 2.8, with excess in the re- 
maining redshift ranges. These ranges correspond to the 
an increased dispersion of ct ^ 0.5. We discuss possible 
reasons why these redshift ranges are poor in ^5.2\ 

4.2.2. SDSS DR5 + GALEX GR3 

Figure [TT] shows Zphot versus Zgpcc for the mean of the 
PDF in a similar manner to Figures[6l[7l and[8l The sam- 
ple size is reduced from 11,149 to 2066, but, as shown 
in B07, the addition of the two GALEX bands (FUV 
and NUV) substantially improves the results for the re- 
maining objects. Here, the dispersion is reduced from 
the SDSS value of ct = 0.343 ± 0.005 to 0.234 ± 0.011, 
and the percentage of non-catastrophics is increased from 
79.8±0.3% to 90.8±0.5%. This improvement is achieved 
without requiring a single peak in the PDF. When this 
requirement is made, the sample size is 1093 ± 24, the 
dispersion is further improved to ct = 0.106 ± 0.016, and 
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Fig. 11. — Photometric versus spectroscopic redshift for 2066 
SDSS DR5 quasars incorporating near- and far-UV photometry 
from matching to GALEX GR3. The improvement over SDSS 
alone is achieved without requiring single-peaked PDFs. 

99.5 ± 0.2% of the quasars are within Az < 0.3. 

We also find that, in addition to this FUV+NUV 
match, the NUV-only match produces results which 
are almost as good, and for a sample that is 70% 
larger, at 3422 objects. The dispersion and percent non- 
catastrophics are ct = 0.242 ± 0.009 and 90.8 ± 0.4%. 
The results for the 1864 single-peaked objects are ct = 
0.109 ± 0.010 and 99.4 ± 0.1%. This is a similar fraction 
of objects with a single peak as seen in the SDSS sample, 
with similar statistics. 

An analysis of the GALEX-SDSS-only data confirm 
that the improvement, as in B07, is genuinely due to 
the addition of the GALEX data and not simply a 
consequence of requiring the objects to be detected by 
GALEX. The full results are given in Tabled 

5. DISCUSSION 
5.1. Galaxies 

For galaxies, the results are similar to previous work, 
as described in t j4.H except that we are now presenting 
PDFs for each galaxy in the form of 100 (for MSG) and 
1024 (for LRG) photometric redshift estimates. These 
PDFs can be used to improve scientific analyses of large 
galaxy samples. 

The overall redshifts from taking the mean values of 
the PDFs show similar trends to previous results, with 
few catastrophic failures and a smoothly decreasing inci- 
dence of objects at increasing jzspec — -Zphotl- The shapes 
of the PDFs generally appear Gaussian, with few widely- 
spaced peaks. As in previous work, Figure [5] shows 
that for MSGs the mean values of Zphot have a slight 
bias toward high values at Zgpcc ^0.1. The single val- 
ues of Zphot from the unperturbed sample do not suffer 
from this and are symmetrically distributed about the 
Zphot ~ 2spcc locus, but the RMS dispersion is higher, at 
CT = 0.0284 ± 0.0002. 

Similar b ehavi or with respect to symmetry is seen by 
iBall et all (2004) (their figure 1), who show the same 
biases in morphological classification, which is largely 
driven by the inverse concentration index. Such bias is 
not li kely due to the lower end of the scale being zero: 
iBalll ()2004l ) (§3.6.3) showed that it is still present when 
the targets are numerically shifted, i.e., that it is the 
end of the scale that matters not its numerical value. A 
more likely explanation is that taking the mean is anal- 
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ogous to using m ultiple neighbors in the training set. 
iBall et aTl ()2004f ) found that whe n the training set for 
the eClass eigen class spec tral typ e ("Conn olly et al.|[r995t 
IConnolly fcS^ lav 1991 lYip et al. 2004) is cut so that 
it is not dominated numerically by galaxies of the eClass 
corresponding to early types, the resulting classifications 
were spread more symmetrically about t he target locus, 
especi ally in the region of early types. IVanzella et al.l 
(|2004f ) (their figure 16) show a similar example with 
SDSS DRl galaxies. Likewise, we find here when us- 
ing a single neighbor that the types are more symmet- 
rically distributed, albeit with higher dispersion. Thus 
it is possible that using multiple neighbors is subject- 
ing the results to the inevitable uneven distribution of 
objects in parameter spac e, in this case co lors, causing 
the same effect as seen in IBall et al.l ([200l . It appears 
to be a generic feature of single nearest-neighbor models 
(also seen for quasars) that the single neighbor produces 
a roughly symmetrical distribution about Zphot = Zspec, 
but using multiple neighbors (as in B07), or, in this case, 
taking the mean of the PDF, reduces the dispersion but 
introduces structure into the relation. However, in the 
latter case the structure is not usually large compared to 
the dispersion. 

The LRGs, being constrained to be in the rcdshift 
range 0.2 < Zgpec < 0.5 appear to suffer less from this 
type of bias at low and high redshift, but instead show 
a kink at Zspcc ~ 0.35. This is likely due to the 4000A 
break passing between the SDSS g and r bands. Again, 
the single photoz values do not show this bias, but the 
dispersion is higher at ct = 0.0318 ± 0.0002. 

5.2. Quasars 

As described in SJH there is a higher incidence of catas- 
trophic failures among quasars compared to galaxies, 
which results in more PDFs with widely spaced peaks. 
Possible contributory factors include reddening, contam- 
ination of the quasar light by a host galaxy, degener- 
acy in the color-redshift relation, low equivalent-width 
lines, and unusual spectral slopes. But, in particular, 
bright spectral lines dropping between filters, or sim- 
ulating other lines at a different redshift, are a major 
contributor. In Figure I12[ we overplot, on the Zphot 
versus Zgpoc plot, the redshifts at which the five bright- 
est emissi on lines of the composite quasar spectrum of 
IVandcn B erk et all (|2001h cross the SDSS filter edges. 
The lines are, in order of flux: Lya, C IV, C III, Mg II, 
and Ha. There is a very clear correspondence between 
the redshifts at which the emission lines cross the filters, 
with particularly striking examples for Mg II at z ~ 0.4, 
Ha at z ~ 0.25, and Lya at z ~ 2.2. The lower right 
panel overplots the five lines, showing that there is no 
visually significant structure that does not correspond to 
one of these lines. The general pattern is that in between 
the lines, the redshifts are less spread out, and they then 
jump to a new value where the lines cross. It is also no- 
ticeable that most of these discontinuities not only corre- 
spond to a line crossing a filter, but to several lines doing 
so in a small redshift range. All this shows that objects 
moving between filters, and the resulting missing infor- 
mation or degeneracy, is a likely cause of much of the 
remaining error in optical quasar photometric redshifts. 
This information could be used to develop 'optimal' fil- 
ter sets for quasar photometric redshift estimation with 



future surveys. 

For multiply-peaked PDFs, when the mean Zphot or 
•Zphot of the highest peak is not correct, one of the other 
peaks is often close to the correct redshift. It turns out 
that if one artificially selects these peaks, then the re- 
sults are no better than those for single peaked objects 
(f2T]and H4.2.ip . Nevertheless, given that these objects 
are a smaller subset of the full sample with a different 
selection function, we attempted to improve the proba- 
bility of the correct peak being chosen for the whole sam- 
ple by applying known prior information to the derived 
PDFs. This was in the form of the known redshift and 
magnitude distributions of quasars, and t heir luminosity 
functi on constructed using the values of iRichards et al.l 
(|2006f ). Given the known n{z) of spectroscopic quasars, 
the photometric redshift is more likely to be in a region of 
high n{z), thus weighting the PDF accordingly may in- 
crease the incidence of the highest peak being the correct 
one. Similarly, a quasar may be assigned a redshift that, 
given its apparent magnitude, would make it unrealisti- 
cally faint or luminous. This can be downweighted by 
applying the magnitude distribution or luminosity func- 
tion. However, we found that the prior information did 
not improve the redshift statistics. This is not especially 
surprising because an empirical training algorithm such 
as the one we are using implicitly takes into account the 
priors by its use of the training set. Thus further weight- 
ing will not add as much information as it would, for 
example, in a template-based method. 

Finally, we investigated the effect of altering the 
threshold (Q above which the area under the PDF is 
counted as part of the redshift peaks. The main effect is 
to alter the relative quality of the subset of objects with 
single peaks. A higher threshold value will produce more 
objects with one peak but the sample is lower quality, 
and vice-versa. The threshold quoted is chosen because 
it corresponds to what a flat PDF would be, but it also 
appears approximately optimal when the percentage of 
objects with one peak is compared to their dispersion or 
Az. For much higher thresholds, a increases relatively 
faster than the sample size; for much lower thresholds, 
the dispersion does not decrease as fast as the sample 
size. 

From the investigations presented in this section, we 
conclude that, for SDSS DR5 optical data, it is unlikely 
that the MSG, LRG, or quasar redshifts can be signifi- 
cantly improved without the addition of new data. How- 
ever, we have shown that, for quasars, if one is willing 
to discard the percentage of the objects that have more 
than one PDF peak, the photometric redshifts are signif- 
icantly improved. 

6. CONCLUSIONS 

We apply nearest neighbor machine learning to ob- 
jects with spectra in the Sloan Digital Sky Survey Data 
Release 5 (SDSS DR5). We subdivide the objects into 
413,361 main sample galaxies, 66,268 luminous red galax- 
ies, and 55,743 quasars, both in the SDSS, and matched 
to the Galaxy Evolution Explorer All Sky Imaging Sur- 
vey Third Data Release (GALEX AIS GR3). Each sam- 
ple is divided into a training set consisting of 80% of 
the objects, and a blind testing set of the remaining 
20%. The algorithm assigns a full probability density 
function (PDF) in photometric redshift to each object in 
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Fig. 12. — Redshifted filters overplotted on ZpiiQ^ versus ^spcc for SDSS DR,5 quasars for the five brightest emission lines. There is a 
clear correspondence between the redshifts at which the emission lines cross SDSS filters and the occurrence of structure in the plot. The 
bottom right-hand panel shows all five lines superimposed. Visually, there is no significant structure that docs not correspond to one of 
these lines, and often several lines are close in redshift. 



the blind testing set by perturbing the input features de- 
scribing the objects, in this case the colors, according to 
the magnitude errors. For main sample galaxies (MSGs), 
each PDF is formed from 100 photometric redshift val- 
ues, and for luminous red galaxies (LRGs) and quasars, 
1024 values. 

We use the spectroscopic redshifts to test the utility 
of the method and find that the RMS dispersions be- 
tween the photometric and spectroscopic redshifts are 
a = 0.0207 ± 0.0001, a = 0.0243 ± 0.0002, and a = 
0.343 ±0.005 for MSGs, LRGs, and quasars, respectively. 
The quoted errors are generated from ten-fold repeated 
holdout validation in the form of ten different training- 
to-blind-testing set splits of the data. Galaxy values are 
similar to previous s t udies, and quasar values are consis- 
tent with lBall et~al] (|2007f ). Cross-matching to GALEX 
reduces the dispersion for quasars to a = 0.2 34 ±0.011 
for the 10,328 matching objects. The improvement is 
due to the GALEX photometry and not simply an arti- 
fact of requiring the objects to be detected by GALEX. 
It may be possible to improve the galaxy results by in- 
corporating morphological information into the training 
set, for example the inverse concentration index, but the 
improvement would likely be small for these data. 

For quasars, use of the PDFs enables us to construct 
subsamples which show dramatically improved statistics. 
In particular, selection of objects with a single PDF peak 
in the full SDSS reduces the sample size by two thirds 
but improves the dispersion from a = 0.343 ± 0.005 to 
fj = 0.117 ± 0.010, with a substantial increase in the 
number of non-catastrophic failures (photometric minus 
spectroscopic redshift less than 0.3) from 79.8 ± 0.3% to 
99.3 ± 0.1%. The equivalent statistics for the GALEX 
sample are a = 0.234 ± 0.011 to cr = 0.106 ± 0.016, and 
90.8 ± 0.5% to 99.5 ± 0.2%. The improved samples alter 
the selection function, but there is a good fraction of 
high percentage one-peak-regions over almost the whole 
redshift range. 



We attempted weighting the PDFs according to known 
prior information on the distributions of quasars in red- 
shift, apparent magnitude, absolute magnitude from the 
photometric redshift, a nd absolute ma g nitud e from the 
luminosity function of iRichards et all (|2006f l. but this 
did not improve the statistics. We also derived statistics 
for the PDF peak closest to the spectroscopic redshift 
regardless of its height, and found that these do not im- 
prove on the statistics for quasars with one peak. We 
overplot the redshifts at which bright quasar emission 
lines cross the SDSS filter edges and find that there is 
a clear correspondence with changes in the photometric 
redshift dispersion. This strongly suggests that a large 
fraction of the remaining poor redshifts are caused by 
lines disappearing at filter edges, or simulating other 
lines. We conclude that further improvement requires 
better data in the form of spectra for fainter objects, or 
a larger number of filters, both within and beyond the 
optical range (in particular the UV and IR). 

The NN method is conceptually simple, and, once the 
dataset has been selected, has no adjustable parame- 
ters. This means that, unlike most machine learning 
algorithms, all of the information in the training data 
is used, and all of the computation contributes to the 
final result, rather than exploring parameter space and 
generating mostly unused results. 

Future work includes the application of the methods 
here to more and deeper optical data, e.g., the full SDSS 
photometric database, the 2QZ, 2SLAQ, SDSS Southern, 
VVDS, DEEP2, and COSMOS surveys, and the addi- 
tion of infrared data via UKIDSS and S-COSMOS. For 
quasars, a further useful addition would be a filter set 
that is customized for quasars rather than the stars and 
galaxies typical of optical sur yeys, or more and narrower 
filters, as in COMBO-17 (e.g. JWolf et"al][200l . but for a 
wider field. The filters should overlap to minimize errors 
from line movement across filter edges. 
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TABLE 1 

Photometric redshift PDF statistics. For quasars, the improvements resulting from cross-matching to the GALEX UV bands are clear. For SDSS 

MAIN sample galaxies (MSGs) AND LUMINOUS RED GALAXIES (LRGs), THE CROSS-MATCH PRODUCES LITTLE IMPROVEMENT, AS EXPECTED. QUOTED ERRORS ARE THE 
standard DEVIATION FROM TEN-FOLD REPEATED HOLDOUT VALIDATION. THE Az {|2spoc — ZphotI) THRESHOLDS ARE 0.01, 0.02, AND 0.03 FOR MSGS AND LRGS, AND 

0.1, 0.2, AND 0.3 FOR QUASARS. 
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87.3 + 2.7 


GALEX-SDSS-only 


QSO 


FUV+NUV 


933 + 20 


0.143 + 0.031 


81.5 + 1.2 


97.2 + 0.7 


98.8 + 0.4 


GALEX-SDSS-only 


QSO 


NUV only 


1542 + 18 


0.123 + 0.015 


77.1 + 0.7 


96.7 + 0.4 


99.1 + 0.2 



td 



